AI safety

Mechanistic Interpretability for AI Safety -- A Review

https://arxiv.org/abs/2404.14082

大規模言語モデルにおける安全性の実現と方向性

https://llmc.nii.ac.jp/wp-content/uploads/2024/10/20240925_t4_sekine.pdf

Robust Intelligence

https://www.robustintelligence.com/

citadel AI

https://www.citadel.co.jp/

渋谷の牛タン屋で横にいたカップルとAI開発における演繹と帰納について

https://storialaw.jp/blog/4532

ChatGPT vs BERT：どちらが日本語をより理解できるのか？

https://fintan.jp/page/9126/

オープンソースLLMの日本語評価結果 - W&Bローンチで誰でも再現可能に

https://note.com/wandb_jp/n/n2464e3d85c1a

lm-evaluation-harness

https://github.com/EleutherAI/lm-evaluation-harness

第95回 Machine Learning 15minutes! Hybrid 切り抜き

https://www.youtube.com/watch?v=w8M7DRVOR54

「AI Safety の必要性と具体的な攻撃、その対策について」松尾研 LLM コミュニティ "Paper & Hacks Vol.30"

https://www.youtube.com/watch?v=ji1G90kUel8

「AI Safety の必要性と具体的な攻撃、その対策について」

https://www.youtube.com/watch?v=ji1G90kUel8

HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection

https://arxiv.org/abs/2409.17504

LLM Guard - The Security Toolkit for LLM Interactions

https://github.com/protectai/llm-guard

HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

https://arxiv.org/abs/2501.08292

GuardReasoner: Towards Reasoning-based LLM Safeguards

https://arxiv.org/abs/2501.18492

OpenAIのModeration APIを利用してAI彼女を性的被害から守る

https://pixel-freak.com/blog/openai-moderation-api

NeMo Framework で実践する継続事前学習 – 日本語 LLM 編 –

https://developer.nvidia.com/ja-jp/blog/how-to-use-continual-pre-training-with-japanese-language-on-nemo-framework

COLING 2025 Tutorial: Safety Issues for Generative AI

https://librairesearch.github.io/tutorial/static/slides/Safety_Issues_of_GenAI.pdf

https://librairesearch.github.io/tutorial/index.html

AIセーフティ年次レポート2024

https://aisi.go.jp/effort/effort_information/250207_3/

OpenAIのModeration API

https://weel.co.jp/media/moderation-api/

OWASP（Open Web Application Security Project）について

https://zenn.dev/mukkun69n/articles/3f6e689d3cfa87

Jailbreak で遊べるゲーム AILBREAK を開発しました

https://note.com/schroneko/n/n3c8ce016a38b

ASI existential risk: reconsidering alignment as a goal

https://michaelnotebook.com/xriskbrief/index.html

Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

https://arxiv.org/abs/2504.15585

Start your Trustworthy AI Development with Safety Leaderboards in Azure AI Foundry

https://techcommunity.microsoft.com/blog/aiplatformblog/start-your-trustworthy-ai-development-with-safety-leaderboards-in-azure-ai-found/4425165

How much novel security-critical infrastructure do you need during the singularity?

https://www.alignmentforum.org/posts/qKz2hBahahmb4uDty/how-much-novel-security-critical-infrastructure-do-you-need

An Approach to Technical AGI Safety and Security

https://arxiv.org/abs/2504.01849

Six Thoughts On AI Safety

https://windowsontheory.org/2025/01/24/six-thoughts-on-ai-safety/

From Shift Left to Shift Up: Securing the New AI Abstraction Layer

https://www.pillar.security/blog/from-shift-left-to-shift-up-securing-the-new-ai-abstraction-layer

Building and evaluating alignment auditing agents

https://alignment.anthropic.com/2025/automated-auditing/

Technical Acceleration Methods for AI Safety: Summary from October 2025 Symposium

https://www.alignmentforum.org/posts/524pFXTPD8iDWmX4x/technical-acceleration-methods-for-ai-safety-summary-from

First Key Update: Capabilities and Risk Implications

https://internationalaisafetyreport.org/publication/first-key-update-capabilities-and-risk-implications

Reward Hacking Resarch Update Interim report on ongoing work on reward hacking

https://blog.eleuther.ai/reward_hacking/

futureoflife

https://futureoflife.org/

MIRI

https://intelligence.org/